{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "0q8XR3P0GsP3"
   },
   "source": [
    "## Generate Responses and Embeddings with Dolly and Pythia on the Alpaca Dataset\n",
    "\n",
    "This notebook generates responses to prompts from the Alpaca dataset using Dolly and Pythia.\n",
    "\n",
    "Pythia is a decoder architecture from [EleutherAI](https://www.eleuther.ai/). Check out their models on the hub [here](https://huggingface.co/EleutherAI). Dolly is a version of Pythia that Databricks been fine-tuned on an instruction-following dataset similar to Alpaca.\n",
    "\n",
    "Note: This notebook requires at least 12.1GB of CPU memory. If you're using Colab, you'll need Colab Pro and you should set the runtime to use a GPU and high RAM. Unfortunately, the free version of Colab only provides 10 GB of RAM, which isn't enough.\n",
    "\n",
    "Let's get started. Install dependencies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -q accelerate arize-phoenix datasets openai git+https://github.com/huggingface/transformers.git"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "aIU0H74jINnf"
   },
   "source": [
    "Set your OpenAI API key."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import datetime\n",
    "import locale\n",
    "import time\n",
    "import uuid\n",
    "\n",
    "import openai\n",
    "import pandas as pd\n",
    "import phoenix as px\n",
    "import torch\n",
    "from datasets import load_dataset\n",
    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
    "\n",
    "locale.getpreferredencoding = lambda: \"UTF-8\"  # This resolves a Colab bug that occurs sometimes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "openai.api_key = \"your key here\"\n",
    "assert openai.api_key != \"your key here\", \"Set your key\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "kOjvZT8bDXMJ"
   },
   "source": [
    "Download a model (Dolly or Pythia) from Hugging Face and load it onto your device."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# choose a model type by commenting and uncommenting the following lines\n",
    "model_type = \"databricks/dolly-v2-3b\"\n",
    "# model_type = \"EleutherAI/pythia-2.8b\"\n",
    "\n",
    "tokenizer = AutoTokenizer.from_pretrained(model_type, padding_side=\"left\")\n",
    "# Check if GPU is available and set the device accordingly\n",
    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
    "\n",
    "# Load the model and tokenizer\n",
    "model = AutoModelForCausalLM.from_pretrained(model_type, device_map=\"auto\")\n",
    "\n",
    "# Move the model to the device\n",
    "model.to(device)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "SiRA_0_KNbLr"
   },
   "source": [
    "Some model setup."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "RESPONSE_KEY = \"### Response:\"\n",
    "END_KEY = \"### End\"\n",
    "tokenizer_response_key = next(\n",
    "    (token for token in tokenizer.additional_special_tokens if token.startswith(RESPONSE_KEY)), None\n",
    ")\n",
    "\n",
    "generate_kwargs = {}\n",
    "response_key_token_id = None\n",
    "end_key_token_id = None\n",
    "if tokenizer_response_key:\n",
    "    try:\n",
    "        response_key_token_id = tokenizer.encode(tokenizer_response_key)\n",
    "        end_key_token_id = tokenizer.encode(END_KEY)\n",
    "\n",
    "        # Ensure generation stops once it generates \"### End\"\n",
    "        generate_kwargs[\"eos_token_id\"] = end_key_token_id\n",
    "    except ValueError:\n",
    "        pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "VbQBe6nkCxN9"
   },
   "source": [
    "Define a function that takes in an entire text string consisting of many paragraphs and capitilizes the sentence/paragraph we are embedding.\n",
    "\n",
    "**Example:**\n",
    "\n",
    "\"This is an example response from an LLM. The response contains multiple sentences. <WE ARE EMBEDDING THIS SENTENCE, WHICH IS WHY IT IS CAPITALIZED.> Hopefully this is clear.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def capitalize_sentence_in_text(generated_text, target_sentence):\n",
    "    # Find the target sentence in the generated text\n",
    "    sentence_start = generated_text.find(target_sentence)\n",
    "\n",
    "    # Check if the target sentence is found in the generated text\n",
    "    if sentence_start == -1:\n",
    "        print(\"The target sentence was not found in the generated text.\")\n",
    "        return generated_text\n",
    "\n",
    "    sentence_end = sentence_start + len(target_sentence)\n",
    "\n",
    "    # Capitalize the target sentence\n",
    "    capitalized_sentence = generated_text[sentence_start:sentence_end].upper()\n",
    "\n",
    "    # Replace the target sentence with its capitalized version\n",
    "    capitalized_text = (\n",
    "        generated_text[:sentence_start]\n",
    "        + \"<\"\n",
    "        + capitalized_sentence\n",
    "        + \">\"\n",
    "        + generated_text[sentence_end:]\n",
    "    )\n",
    "\n",
    "    return capitalized_text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This takes in a list of token IDs and splits them into paragraphs of token IDs\n",
    "# Output is list of lists. Where list is token IDs of a paragraph\n",
    "def split_paragraphs(generated_ids, tokenizer):\n",
    "    # Define the newline token ID\n",
    "    newline_token_id = tokenizer.encode(\"\\n\")[0]\n",
    "\n",
    "    # Split the tokens into paragraphs\n",
    "    paragraphs = []\n",
    "    paragraph = []\n",
    "    newline_count = 0\n",
    "    total_tokens = 0\n",
    "    for token in generated_ids:\n",
    "        if token == newline_token_id:\n",
    "            newline_count += 1\n",
    "        else:\n",
    "            newline_count = 0\n",
    "        total_tokens += 1\n",
    "        paragraph.append(token)\n",
    "\n",
    "        if newline_count == 2:\n",
    "            paragraphs.append(paragraph)\n",
    "            paragraph = []\n",
    "            newline_count = 0\n",
    "\n",
    "    if paragraph:\n",
    "        paragraphs.append(paragraph)\n",
    "    print(\"Total Tokens\")\n",
    "    print(total_tokens)\n",
    "    return paragraphs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "UnLiwcZlAVvv"
   },
   "source": [
    "Define a function to embed the prompt. The function takes in prompt text and returns an embedding average of tokens."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_prompt_embedding(prompt, model):\n",
    "    # Tokenize the prompt\n",
    "    print(prompt)\n",
    "    prompt_inputs = tokenizer(prompt, return_tensors=\"pt\")\n",
    "\n",
    "    # Move the input to the appropriate device\n",
    "    prompt_inputs = {k: v.to(device) for k, v in prompt_inputs.items()}\n",
    "\n",
    "    # Pass the prompt through the model\n",
    "    prompt_output = model(**prompt_inputs, output_hidden_states=True)\n",
    "\n",
    "    # Extract the hidden states\n",
    "    prompt_hidden_states = prompt_output.hidden_states\n",
    "\n",
    "    # The last hidden state is usually used as the embedding for the sequence\n",
    "    # It has shape [batch_size, sequence_length, hidden_size]\n",
    "    # To get an embedding for the entire sequence, you might average over the sequence length dimension\n",
    "    prompt_embedding = prompt_hidden_states[-1][0].detach().cpu().mean(dim=0)\n",
    "    # Convert the prompt_embedding tensor to a NumPy array\n",
    "    prompt_embedding_np = prompt_embedding.numpy()\n",
    "    return prompt_embedding_np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "hbM82veQApNg"
   },
   "source": [
    "Define a function that takes in generated IDs, generated text and returns a single row dataframe with the generated text and associated embedding and the prompt text and prompt embedding."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_conversation_embeddings(\n",
    "    prompt_len, generated_ids, hidden_states, model, tokenizer, prompt, prompt_category\n",
    "):\n",
    "    # Find sentence boundaries based on tokenized output\n",
    "    generated_text = tokenizer.decode(generated_ids, skip_special_tokens=True)\n",
    "    sentences = split_paragraphs(generated_ids[prompt_len:], tokenizer)\n",
    "    # sentence_boundary_pattern = r'(?<=[\\.\\?!])\\s|\\n'\n",
    "    # paragraph_boundary_pattern = r'\\n\\s*\\n'\n",
    "    # sentences = re.split(paragraph_boundary_pattern, generated_text)\n",
    "    print(\"Total Hidden Length\")\n",
    "    print(len(hidden_states))\n",
    "    print(\"Number of Paragraphs\")\n",
    "    print(len(sentences))\n",
    "\n",
    "    # Compute sentence embeddings by averaging the embeddings of each token in a sentence\n",
    "    sentence_embeddings = []\n",
    "    sentences_texts = []\n",
    "    capitialized_paragraphs = []\n",
    "    response_text = []\n",
    "    start = 0\n",
    "    for sentence in sentences:\n",
    "        tokenized_sentence = sentence\n",
    "        text_sentence = tokenizer.decode(sentence, skip_special_tokens=True)\n",
    "        # Average all the hidden states for the tokens for the sentence\n",
    "        print(\"Len of paragraph\")\n",
    "        print(len(tokenized_sentence))\n",
    "        num_tokens = len(tokenized_sentence)\n",
    "        hidden_size = hidden_states[0][-1].shape[-1]\n",
    "\n",
    "        # Initialize a tensor to store the sum of the hidden states\n",
    "        sum_hidden_states = torch.zeros((hidden_size))\n",
    "\n",
    "        # Iterate through the tokens and sum up the hidden states from the last layer\n",
    "        for i in range(num_tokens):\n",
    "            # Get the last layer embedding for each token\n",
    "            sum_hidden_states += hidden_states[start + i][-1][0][0].cpu()\n",
    "        # Divide by the number of tokens to get the average\n",
    "        sentence_embedding = sum_hidden_states / num_tokens\n",
    "\n",
    "        # Only grab paragraphs above 10 tokens.\n",
    "        if num_tokens > 10:\n",
    "            sentence_embeddings.append(sentence_embedding)\n",
    "            sentences_texts.append(text_sentence)\n",
    "            capitialized_paragraphs.append(\n",
    "                capitalize_sentence_in_text(generated_text, text_sentence)\n",
    "            )\n",
    "            response_text.append(generated_text)\n",
    "\n",
    "        start += num_tokens\n",
    "\n",
    "    prompt_embedding = create_prompt_embedding(prompt, model)\n",
    "\n",
    "    # Convert sentence_embeddings to NumPy arrays on the CPU\n",
    "    cpu_sentence_embeddings = [embedding.cpu().numpy() for embedding in sentence_embeddings]\n",
    "\n",
    "    uid = str(uuid.uuid4())[:20]\n",
    "\n",
    "    # Create a list of UIDs for each row\n",
    "    uids = [uid] * len(sentences_texts)\n",
    "\n",
    "    # Create a list of prompts for each row\n",
    "    prompts = [prompt] * len(sentences_texts)\n",
    "\n",
    "    prompt_category_list = [prompt_category] * len(sentences_texts)\n",
    "\n",
    "    prompt_embedding_list = [prompt_embedding] * len(sentences_texts)\n",
    "    print(\"prompt embedding\")\n",
    "    print(prompt_embedding_list)\n",
    "    print(prompt_embedding)\n",
    "    # Create a DataFrame from the lists of cpu_sentence_embeddings, sentence_texts, capitalized_texts, and UIDs\n",
    "    data = {\n",
    "        \"conversation_id\": uids,\n",
    "        \"prompt\": prompts,\n",
    "        \"prompt_embedding\": prompt_embedding_list,\n",
    "        \"response_paragraph\": sentences_texts,\n",
    "        \"response_capitalized\": capitialized_paragraphs,\n",
    "        \"response_text\": response_text,\n",
    "        \"paragraph_embedding\": cpu_sentence_embeddings,\n",
    "        \"prompt_category\": prompt_category_list,\n",
    "    }\n",
    "    df = pd.DataFrame(data)\n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "column_names = [\n",
    "    \"conversation_id\",\n",
    "    \"prompt\",\n",
    "    \"response_paragraph\",\n",
    "    \"response_capitalized\",\n",
    "    \"paragraph_embedding\",\n",
    "    \"prompt_category\",\n",
    "]\n",
    "df1 = pd.DataFrame(columns=column_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "9LslIwrUEe65"
   },
   "source": [
    "Load the Alpaca dataset and format prompts from the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "alpaca_df = load_dataset(\"tatsu-lab/alpaca\", split=\"train\").to_pandas()\n",
    "alpaca_df[\"prompt\"] = (\n",
    "    \"Below is an instruction that describes a task. Write a response that appropriately completes the request.\\n\"\n",
    "    + \"### Instruction:\\n\"\n",
    "    + alpaca_df[\"instruction\"]\n",
    "    + \"### Input:\\n\"\n",
    "    + alpaca_df[\"input\"]\n",
    "    + \"### Response:\\n\"\n",
    ")\n",
    "alpaca_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sample_alpaca_df = alpaca_df[:60]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dfrjRkeWInuD"
   },
   "source": [
    "Iterate through each prompt, generate a response using the model, and create an embedding from the response."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for index, row in sample_alpaca_df.iterrows():\n",
    "    print(\"Index \" + str(index))\n",
    "\n",
    "    generated_inputs = tokenizer(row[\"prompt\"], return_tensors=\"pt\")\n",
    "    input_ids = generated_inputs[\"input_ids\"].to(device)\n",
    "    attention_mask = generated_inputs[\"attention_mask\"].to(device)\n",
    "    pad_token_id = tokenizer.pad_token_id\n",
    "    prompt_len = len(input_ids[0])\n",
    "    attention_mask = generated_inputs.get(\"attention_mask\", None).to(device)\n",
    "\n",
    "    model_data_output = model.generate(\n",
    "        input_ids,\n",
    "        do_sample=True,\n",
    "        temperature=0.9,\n",
    "        attention_mask=attention_mask,\n",
    "        max_length=250,\n",
    "        output_hidden_states=True,\n",
    "        return_dict_in_generate=True,\n",
    "        pad_token_id=tokenizer.pad_token_id,\n",
    "        **generate_kwargs,\n",
    "    )\n",
    "    generated_text = tokenizer.decode(model_data_output.sequences[0])\n",
    "    print(generated_text)\n",
    "    df_2 = create_conversation_embeddings(\n",
    "        prompt_len,\n",
    "        model_data_output.sequences[0],\n",
    "        model_data_output.hidden_states,\n",
    "        model,\n",
    "        tokenizer,\n",
    "        row[\"prompt\"],\n",
    "        \"\",\n",
    "    )\n",
    "    # Concatenate the empty DataFrame with the new DataFrame\n",
    "    df1 = pd.concat([df1, df_2], ignore_index=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "AGmowEtGJXKl"
   },
   "source": [
    "Evaluate the prompt response pairs with a call to the OpenAI API."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def evaluate_question_answer_pair(question, answer):\n",
    "    prompt = f\"You are an evaluation model that evaluates the accuracy of the response of question and answer pairs. Please score the result from 0-1 based on how good the answer is, where 0 is the worst. Question:\\n{question}\\nAnswer:\\n{answer}\\nPlease return a number from 0-1:\"\n",
    "\n",
    "    response = openai.ChatCompletion.create(\n",
    "        model=\"gpt-4\",\n",
    "        messages=[{\"role\": \"user\", \"content\": prompt}],\n",
    "    )\n",
    "    print(prompt)\n",
    "    print(response.choices[0])\n",
    "    time.sleep(2)  # avoid rate-limiting from the OpenAI API\n",
    "    gpt_response = response.choices[0].message.content.strip()\n",
    "    return gpt_response\n",
    "\n",
    "\n",
    "def evaluate_response_for_dataframe(df):\n",
    "    df[\"evals\"] = df.apply(\n",
    "        lambda row: evaluate_question_answer_pair(row[\"prompt\"], row[\"response_paragraph\"]), axis=1\n",
    "    )\n",
    "    return df\n",
    "\n",
    "\n",
    "post_eval_df = evaluate_response_for_dataframe(df1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "whfO9te4J4jY"
   },
   "source": [
    "Convert column to float and replace non-numeric values with 0."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "post_eval_df[\"evals\"] = pd.to_numeric(post_eval_df[\"evals\"], errors=\"coerce\").fillna(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "PqnSs2VAJz2H"
   },
   "source": [
    "Calculate the mean evaluation score."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "post_eval_df[\"evals\"].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "28ZnaI-rHoqs"
   },
   "source": [
    "Save off a copy of the dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# get a formatted timestamp\n",
    "now = datetime.datetime.now()\n",
    "timestamp = now.strftime(\"%Y-%m-%d_%H-%M-%S\")\n",
    "\n",
    "save_df = post_eval_df.copy()\n",
    "save_df[\"prompt_embedding_vec\"] = save_df[\"prompt_embedding\"].apply(lambda x: str(x.tolist()))\n",
    "save_df[\"paragraph_embedding_vec\"] = save_df[\"paragraph_embedding\"].apply(lambda x: str(x.tolist()))\n",
    "\n",
    "# Create the file name with date and time and save\n",
    "file_name = f'{model_type.split(\"/\")[1]}_{timestamp}'\n",
    "file_name += \".csv\"\n",
    "save_df.to_csv(file_name, index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BpQPtxURHLy_"
   },
   "source": [
    "Optionally launch Phoenix to visualize your embedding data and get a sanity check."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "schema = px.Schema(\n",
    "    prompt_column_names=px.EmbeddingColumnNames(\n",
    "        raw_data_column_name=\"prompt\", vector_column_name=\"prompt_embedding\"\n",
    "    ),\n",
    "    response_column_names=px.EmbeddingColumnNames(\n",
    "        raw_data_column_name=\"response_paragraph\", vector_column_name=\"paragraph_embedding\"\n",
    "    ),\n",
    "    tag_column_names=[\n",
    "        \"prompt_category\",\n",
    "        \"conversation_id\",\n",
    "        \"response_capitalized\",\n",
    "        \"response_text\",\n",
    "    ],\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_name = model_type.split(\"/\")[1]\n",
    "ds = px.Dataset(dataframe=post_eval_df, schema=schema, name=model_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "session = px.launch_app(ds)"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
